Finding meaningful patterns within data has become obtrusive as data collection and management continues to grow at an unprecedented rate
By the end of this presentation we will have discussed the following concepts of k-means clustering:
# A tibble: 10 x 8
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID
<chr> <chr> <chr> <dbl> <chr> <dbl> <dbl>
1 536365 85123A WHITE HANGING ~ 6 12/1/2010 ~ 2.55 17850
2 536365 71053 WHITE METAL LA~ 6 12/1/2010 ~ 3.39 17850
3 536365 84406B CREAM CUPID HE~ 8 12/1/2010 ~ 2.75 17850
4 536365 84029G KNITTED UNION ~ 6 12/1/2010 ~ 3.39 17850
5 536365 84029E RED WOOLLY HOT~ 6 12/1/2010 ~ 3.39 17850
6 536365 22752 SET 7 BABUSHKA~ 2 12/1/2010 ~ 7.65 17850
7 536365 21730 GLASS STAR FRO~ 6 12/1/2010 ~ 4.25 17850
8 536366 22633 HAND WARMER UN~ 6 12/1/2010 ~ 1.85 17850
9 536366 22632 HAND WARMER RE~ 6 12/1/2010 ~ 1.85 17850
10 536367 84879 ASSORTED COLOU~ 32 12/1/2010 ~ 1.69 13047
# i 1 more variable: Country <chr>
# A tibble: 6 x 4
CustomerID Sales Orders AvgSale
<dbl> <dbl> <int> <dbl>
1 12347 4310 7 616.
2 12348 1797. 4 449.
3 12349 1758. 1 1758.
4 12350 334. 1 334.
5 12352 2506. 8 313.
6 12353 89 1 89
There are 5 main steps to execute the k-means clustering method
Clustering is the act of partitioning data into meaningful groups based on similarity of attributes
The goal of clustering is to create insightful clusters to better understand connections in the data
\[d(x,C_i)=sqrt(\sum_{i=1}^{N} (x_j−C_{ij})^2)\]
Objective Function:
It is formulated as:
\[ d(x,C_i)=(\sum_{i=1}^{k}*\sum_{x \in C_i}^{}(||x-\mu_i||)^2) \]
\(k\) is the number of clusters
\(C_i\) represents the number of points in the cluster \(i\)
\(\mu_i\) represents the centroid mean of cluster \(i\)
In this context, similarity is inversely related to the Euclidean distance
The smaller the distance, the greater the similarity between objects
K-means clustering reassigns the data points to each cluster based on the Euclidean Distance calculation
A new centroid location is set by updating the position of each clusters mean center
library(ggplot2)
# Plot the WCSS values against the number of clusters
p1<- ggplot(data.frame(K=1:10, WCSS=wcss), aes(x=K, y=WCSS)) +
geom_line() +
geom_point() +
labs(title="Elbow Method to Find Optimal K", x="Number of Clusters (K)", y="Within-Cluster-Sum-of-Squares (WCSS)") +
scale_x_continuous(breaks = seq(0, 10, by = 1))# A tibble: 4 x 6
Sales Orders AvgSale size withinss cluster
<dbl> <dbl> <dbl> <int> <dbl> <fct>
1 -0.684 -0.964 0.155 75 42.2 1
2 0.536 -0.500 1.21 73 55.8 2
3 0.969 1.07 0.311 145 148. 3
4 -1.03 -0.374 -1.16 125 120. 4
Cluster 1:
Cluster 2:
Cluster 3:
Cluster 4:
Challenges and Considerations -
Data Handling:
Managing large and noisy datasets
Robustness:
Ensuring robustness against outliers
Cluster Number Determination:
Defining an appropriate number of clusters
Research Focus -
Continued Exploration:
Ongoing refinement of clustering techniques and cluster selection process
Industry Evolution:
Adapting newer methods to meet evolving e-commerce demands